The task for this project is to automatically predict the job title from job descriptions. We start from preprocessing the job descriptions and title of Job Data and then we will build a simple model to predict the Job Title.
In this task we will be using the following libraries:
import pandas as pd
from scipy.sparse import coo_matrix, vstack
from sklearn.preprocessing import MultiLabelBinarizer
import lightgbm as lgb
import scipy
import numpy as np
import nltk, re
nltk.download('stopwords') # load english stopwords
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter("ignore")
warnings.warn("deprecated", DeprecationWarning)
warnings.simplefilter("ignore")
[nltk_data] Downloading package stopwords to [nltk_data] /Users/tajalahluwalia/nltk_data... [nltk_data] Package stopwords is already up-to-date!
dataset = pd.read_csv('Train_rev1.csv', error_bad_lines=False, engine="python")
# dataset = dataset.sample(10000)
dataset['Category']= dataset.Category.str[0:].str.split(' ').tolist()
dataset = dataset[['FullDescription', 'Category']]
# 70-30% random split of dataset
X_train, X_test, y_train, y_test = train_test_split(dataset['FullDescription'].values,
dataset['Category'].values,
test_size=0.33,
random_state=42)
dataset
| FullDescription | Category | |
|---|---|---|
| 0 | Engineering Systems Analyst Dorking Surrey Sal... | [Engineering, Jobs] |
| 1 | Stress Engineer Glasgow Salary **** to **** We... | [Engineering, Jobs] |
| 2 | Mathematical Modeller / Simulation Analyst / O... | [Engineering, Jobs] |
| 3 | Engineering Systems Analyst / Mathematical Mod... | [Engineering, Jobs] |
| 4 | Pioneer, Miser Engineering Systems Analyst Do... | [Engineering, Jobs] |
| ... | ... | ... |
| 244763 | Position: Qualified Teacher Subject/Specialism... | [Teaching, Jobs] |
| 244764 | Position: Qualified Teacher or NQT Subject/Spe... | [Teaching, Jobs] |
| 244765 | Position: Qualified Teacher Subject/Specialism... | [Teaching, Jobs] |
| 244766 | Position: Qualified Teacher Subject/Specialism... | [Teaching, Jobs] |
| 244767 | This entrepreneurial and growing private equit... | [Teaching, Jobs] |
244768 rows × 2 columns
The fact that natural data is unstructured is one of the most well-known challenges when working with it. If we use it "as is" and extract tokens by splitting titles by whitespaces, we'll see that there are a lot of "strange" tokens. To avoid these issues, it's usually a good idea to prepare the data in some way.
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = list((stopwords.words('english')))
def text_prepare(text,join_sumbol):
text = str(text)
"""
text: a string
return: modified initial string
"""
# replace REPLACE_BY_SPACE_RE symbols by space in text
text = re.sub(REPLACE_BY_SPACE_RE," ",text,)
# lowercase text
text = text.lower()
# delete symbols which are in BAD_SYMBOLS_RE from text
text = re.sub(BAD_SYMBOLS_RE,"",text)
text = re.sub(r'\s+'," ",text)
# delete stopwords from text
text = f'{join_sumbol}'.join([i for i in text.split() if i not in STOPWORDS])
return text
# test if text_prepare works
tests = ["Clean this $%$^^$ 1213 data!!"]
for test in tests: print(text_prepare(test,' '))
clean 1213 data
We can now use the function text_prepare to preprocess the data, ensuring that they do not include any invalid symbols.
X_train = [text_prepare(x,' ') for x in X_train]
X_test = [text_prepare(x,' ') for x in X_test]
y_train = [text_prepare(x,' ') for x in y_train]
y_test = [text_prepare(x,' ') for x in y_test]
from collections import Counter
from itertools import chain
# Dictionary of all tags from train corpus with their counts.
tags_counts = Counter(chain.from_iterable([i.split(" ") for i in y_train]))
# Dictionary of all words from train corpus with their counts.
words_counts = Counter(chain.from_iterable([i.split(" ") for i in X_train]))
top_3_most_common_tags = sorted(tags_counts.items(), key=lambda x: x[1], reverse=True)[:3]
top_3_most_common_words = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:3]
print(f"Top three most popular Categories words are: {','.join(tag for tag, _ in top_3_most_common_tags)}")
print(f"Top three most popular Description words are: {','.join(tag for tag, _ in top_3_most_common_words)}")
Top three most popular Title words are: jobs,engineering,accounting Top three most popular Description words are: experience,role,work
We can't use the provided text data "as is" since machine learning algorithms work on numeric data. Text data can be converted into numeric vectors in a variety of ways. We'll try to employ two of them in this article.
A bag-of-words representation is one of the most well-known techniques. Follow the steps below to accomplish this transformation:
The described encoding will now be implemented in the function with a dictionary size of 10000. We use train data to find the most common terms.
# We considered only the top 5,000 words, this parameter can be fine-tuned
DICT_SIZE = 5000
WORDS_TO_INDEX = {j[0]:i for i,j in enumerate(sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:DICT_SIZE])}
INDEX_TO_WORDS = {i:j[0] for i,j in enumerate(sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:DICT_SIZE])}
ALL_WORDS = WORDS_TO_INDEX.keys()
def my_bag_of_words(text, words_to_index, dict_size):
"""
text: a string
dict_size: size of the dictionary
return a vector which is a bag-of-words representation of 'text'
"""
result_vector = np.zeros(dict_size)
keys= [words_to_index[i] for i in text.split(" ") if i in words_to_index.keys()]
result_vector[keys]=1
return result_vector
We apply the implemented function to all samples at this point. To store the useful information efficiently, we change the data to a sparse representation. There are many different sorts of such representations, but sklearn algorithms can only deal with the csr matrix, so we'll use that.
X_train_mybag = scipy.sparse.vstack([scipy.sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_test_mybag = scipy.sparse.vstack([scipy.sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])
print('X_train shape ', X_train_mybag.shape)
print('X_test shape ', X_test_mybag.shape)
X_train shape (163994, 10000) X_test shape (80774, 10000)
The second method builds on the bag-of-words framework by accounting for the total frequency of words in the corpora. It aids in penalizing overuse of words and providing more features space.
To train a vectorizer, we utilize scikit-TfidfVectorizer learn's and our train corpus. Don't forget to investigate the arguments you can send to it. Too few words (less than 5) and too many words are filtered away (occur more than in 90 percent of the titles). Use bigrams and unigrams in our vocabulary as well.
from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_features(X_train, X_test):
"""
X_train, X_val, X_test — samples
return bag-of-words representation of each sample and vocabulary
"""
# Create TF-IDF vectorizer with a proper parameters choice
# Fit the vectorizer on the train set
# Transform the train, test, and val sets and return the result
tfidf_vectorizer = TfidfVectorizer(X_train,ngram_range=(1,2),max_df=0.9,min_df=5,token_pattern=r'(\S+)' )
tfidf_vectorizer.fit(X_train)
X_train = tfidf_vectorizer.transform(X_train)
X_test = tfidf_vectorizer.transform(X_test)
return X_train, X_test, tfidf_vectorizer.vocabulary_
X_train_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_test)
tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}
print("manager" in set(tfidf_reversed_vocab.values()))
print("engineer" in set(tfidf_reversed_vocab.values()))
True True
As we've seen before, each sample in this exercise can contain multiple terms in the title. We must convert labels to binary form, with the prediction being a mask of 0s and 1s. MultiLabelBinarizer from sklearn is useful for this.
Before providing each element of labels to the MultiLabelBinarizer, we must first convert it to a dictionary.
# transform to dictionary
y_train = [set(i.split(' ')) for i in y_train]
y_test = [set(i.split(' ')) for i in y_test]
y_train[12]
{'accounting', 'finance', 'jobs'}
# fit the transformer
mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(y_train)
y_test = mlb.fit_transform(y_test)
y_train[0]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0])
mlb.classes_
array(['accounting', 'admin', 'advertising', 'catering', 'charity',
'cleaning', 'construction', 'consultancy', 'creative', 'customer',
'design', 'domestic', 'energy', 'engineering', 'finance', 'gas',
'general', 'graduate', 'healthcare', 'help', 'hospitality', 'hr',
'jobs', 'legal', 'logistics', 'maintenance', 'manufacturing',
'marketing', 'nursing', 'oil', 'part', 'pr', 'property', 'qa',
'recruitment', 'retail', 'sales', 'scientific', 'services',
'social', 'teaching', 'time', 'trade', 'travel', 'voluntary',
'warehouse', 'work'], dtype=object)
We recommend adopting the One-vs-Rest technique in this task, which is implemented in the OneVsRestClassifier class. k classifiers (= number of tags) are trained in this method.
It is one of the most basic strategies, but it often suffices in text categorization tasks.
Because there are so many classifiers to train, it may take some time.
# For multiclass classification
from sklearn.multiclass import OneVsRestClassifier
# Models
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from lightgbm import LGBMClassifier
def train_classifier(X_train, y_train, X_valid=None, y_valid=None, C=1.0, model='lr'):
"""
X_train, y_train — training data
return: trained classifier
"""
if model=='lr':
model = LogisticRegression(C=C, penalty='l1', dual=False, solver='liblinear')
model = OneVsRestClassifier(model)
model.fit(X_train, y_train)
elif model=='svm':
model = LinearSVC(C=C, penalty='l1', dual=False, loss='squared_hinge')
model = OneVsRestClassifier(model)
model.fit(X_train, y_train)
elif model=='nbayes':
model = MultinomialNB(alpha=1.0)
model = OneVsRestClassifier(model)
model.fit(X_train, y_train)
elif model=='lda':
model = LinearDiscriminantAnalysis(solver='svd')
model = OneVsRestClassifier(model)
model.fit(X_train, y_train)
return model
# Train the classifiers for different data transformations: bag-of-words and tf-idf.
# Linear NLP model using bag of words approach
%time classifier_mybag = train_classifier(X_train_mybag, y_train, C=1.0, model='lr')
# Linear NLP model using TF-IDF approach
%time classifier_tfidf = train_classifier(X_train_tfidf, y_train, C=1.0, model='lr')
CPU times: user 7min 51s, sys: 9 s, total: 8min Wall time: 8min 2s CPU times: user 5min 28s, sys: 22.6 s, total: 5min 51s Wall time: 5min 56s
y_test_predicted_labels_mybag = classifier_mybag.predict(X_test_mybag)
y_test_predicted_labels_tfidf = classifier_tfidf.predict(X_test_tfidf)
y_test_pred_inversed = mlb.inverse_transform(y_test_predicted_labels_tfidf)
y_test_inversed = mlb.inverse_transform(y_test)
for i in range(3):
print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
X_test[i],
','.join(y_test_inversed[i]),
','.join(y_test_pred_inversed[i])
))
Title: business account manager plumbing heating products basic salary south west company currently recruiting due internal promotion wellknown construction group history stretching back turnover excess 300 million selling plumbing heating products plumbing heating contractors dealing every sector smes 1man bands right larger contractors role even split account management new business development inherit excellent accounts sales career opportunities within group excellent person following skills field sales track record sold construction industrial product seek someone excellent organisational account development skills receive full product training package basic salary 30 bonuses fully expensed company car mobile pension laptop south west wales avon bristol bath somerset dorset gloucestershire cornwall wiltshire glamorgan carmarthenshire cardiff swansea bms leading consultancy specialising sales recruitment established 1990 bms achieved truly nationwide presence number regional centres south west operation established 1999 introduced service needs candidates clients alike throughout south west wales offering sales jobs trainees sales representatives sales executives sales engineers area sales managers territory managers account managers opportunities available every corner uk initial meetings occur convenient location bristol m4 m32 within easy reach m5 committed meeting potentially suitable candidates face face furthermore organisation consists several highly focused teams aimed specific market sectors enabling us deliver service directly tailored needs please take time search website wwwbmsukcom sales alternatively contact tina vine job originally posted wwwtotaljobscom jobseeking businessaccountmanager_job True labels: jobs,sales Predicted labels: jobs,sales Title: job title staff nurse rgn rmn nightslocation newton abbeysalary per hourhours part time 22 hours per weekskills nmc registration nursing home rgn rmn staff nurse registered general nurse registered mental health nurse mental health old agejob reference rgnregional recruitment services currently recruiting staff nurse rgn rmn work within medium sized nursing home based newton abbey area role provide high standard nursing care clients suffering old age mental health physical disabilitiesto ensure compliance cqc standards guidelinesto write implement set care plansto administer prescribed medicationto conduct accurate risk assessments ensure information recorded correctly ensure smooth running home night shifts oversee unqualified members staff whilst shift candidate must registered general nurse registered mental health nursemust current nmc pin numbermust able work nights part time basismust previous experience working within nursing home setting must previous experience working clients suffer old age mental health physical disabilities must passion caring others package competitive salary generous package excellent benefits befits one uks prestigious organisations highly competitive holiday package selection benefits excellent working environment promotion opportunities high levels job security great career pathway combined clinical skills development secure supportive working environmentto considered opportunity please apply directly website send cv us us directly alexhowarthregionalrecruitmentcom would like speak us detail applying please call alex howarth danielle fyfe quoting reference rgn position advertised behalf regional recruitment services ltd also variety permanent positions available ranging care assistants care home managers care jobs charge nurse jobs childrens nurses clinical lead nurses clinical nurses clinical nurse specialists community childrens nurses community mental health community sisters community staff nurses community workers deputy care managers deputy ward managers district nurses team leaders emergency nurse posts hdu nurse positions health care assistants home manager jobs icu nurse lead nurse midwife modern matron neonatal staff nurse jobs nurse advisors nurse team leader posts nursing auxiliary nvq assessor occupational health nurses occupational therapists oncology nurses paediatric nurses practice nurses recovery nurses registered general nurse posts registered nurse posts residential adult care jobs residential child care jobs rgns rmns rnlds school nurses scrub nurses senior sisters social care posts social worker positions staff nurse e support workers theatre manager posts theatre nurses theatre practitioners theatre support workers ward managers ward sister posts True labels: healthcare,jobs,nursing Predicted labels: healthcare,jobs,nursing Title: dynamic international development charity recruiting community fundraising manager play key part delivery charitys fundraising strategy take lead developing community fundraising programme including expanding appeal programme charitys volunteer network key day day duties within position include work closely direct marketing manager deliver fundraising appeal schools churches including undertaking evaluation previous appeals developing plan increase income community dm activity take lead developing high value support schools churches community groups develop grow friends network including building effective relationships existing groups individual community fundraisers recruiting new supporters increasing overall support important ambassadors develop charitys speaker network contribute development community fundraising strategy undertake review analysis past activity identify areas potential growth identify new business opportunities community fundraising demonstrate excellent relationship management key community groups individuals coordinate supervise thank reactivation recruitment calling process undertaken volunteer including evaluating impact calls make manage community fundraising budget effectively monitoring performance budget monthly basis contributing regular reforecasting line management teams fundraising assistant successful applicant following skills experience educated degree level equivalent significant demonstrable experience working community fundraising role experience managing mass communications direct mail appeals experience effectively building managing relationships range groups individuals experience working income targets managing expenditure budgets experience working schools faith groups volunteers community groups experience managing volunteers experience giving presentations representing organisation range events proven excellent project management skills proven excellent relationship management skills excellent written communication skills including excellent attention detail confident effective verbal communication skills closing date 28th january 2013 interested role wish register tpp hear future posts please send cv fundraisingtppcouk try get touch applications interest however due volume applications receive isnt always possible plus free training fundraisers fundraisers secure role tpp receive cpd voucher use institute fundraising details available tpp profit website http wwwtppcouk cpdvoucher True labels: charity,jobs,voluntary Predicted labels: charity,jobs,voluntary
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from functools import partial
def print_evaluation_scores(y_val, predicted):
f1_score_macro = partial(f1_score,average="macro")
f1_score_micro = partial(f1_score,average="micro")
f1_score_weighted = partial(f1_score,average="weighted")
average_precision_score_macro = partial(average_precision_score,average="macro")
average_precision_score_micro = partial(average_precision_score,average="micro")
average_precision_score_weighted = partial(average_precision_score,average="weighted")
scores = [accuracy_score,f1_score_macro,f1_score_micro,f1_score_weighted,average_precision_score_macro,
average_precision_score_micro,average_precision_score_weighted]
for score in scores:
print(score,score(y_val,predicted))
print('Bag-of-words')
print_evaluation_scores(y_test, y_test_predicted_labels_mybag)
print('Tfidf')
print_evaluation_scores(y_test, y_test_predicted_labels_tfidf)
Bag-of-words <function accuracy_score at 0x7fef9a4a9d30> 0.6055909079654345 functools.partial(<function f1_score at 0x7fef9a576550>, average='macro') 0.5892786511201468 functools.partial(<function f1_score at 0x7fef9a576550>, average='micro') 0.821501106257446 functools.partial(<function f1_score at 0x7fef9a576550>, average='weighted') 0.8052453113207757 functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='macro') 0.40757959686158407 functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='micro') 0.6880770264936487 functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='weighted') 0.7078709573392094 Tfidf <function accuracy_score at 0x7fef9a4a9d30> 0.5813380543244113 functools.partial(<function f1_score at 0x7fef9a576550>, average='macro') 0.5167656028586087 functools.partial(<function f1_score at 0x7fef9a576550>, average='micro') 0.8147967431880309 functools.partial(<function f1_score at 0x7fef9a576550>, average='weighted') 0.7844110301221876 functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='macro') 0.35677602112734996 functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='micro') 0.6826627731365503 functools.partial(<function average_precision_score at 0x7fef9a4a05e0>, average='weighted') 0.6938637800525888
We'll utilize F1-score weighted as an evaluation measure, and we'll experiment with L1 and L2-regularization procedures in Logistic Regression with various coefficients (e.g. C equal to 0.1, 1, 10, 100).
import matplotlib.pyplot as plt
hypers = np.arange(0.1, 1.1, 0.1)
res = []
for h in hypers:
temp_model = train_classifier(X_train_tfidf, y_train, C=h, model='lr')
temp_pred = f1_score(y_test, temp_model.predict(X_test_tfidf), average='weighted')
res.append(temp_pred)
plt.figure(figsize=(7,5))
plt.plot(hypers, res, color='blue', marker='o')
plt.grid(True)
plt.xlabel('Parameter $C$')
plt.ylabel('Weighted F1 score')
plt.show()
# Final model
C = 1.0
classifier = train_classifier(X_train_tfidf, y_train, C=C, model='lr')
# Results
test_predictions = classifier.predict(X_test_tfidf)
test_pred_inversed = mlb.inverse_transform(test_predictions)
def print_words_for_tag(classifier, tag, tags_classes, index_to_words, all_words):
"""
classifier: trained classifier
tag: particular tag
tags_classes: a list of classes names from MultiLabelBinarizer
index_to_words: index_to_words transformation
all_words: all words in the dictionary
return nothing, just print top 5 positive and top 5 negative words for current tag
"""
print('Tag:\t{}'.format(tag))
tag_n = np.where(tags_classes==tag)[0][0]
model = classifier.estimators_[tag_n]
top_positive_words = [index_to_words[x] for x in model.coef_.argsort().tolist()[0][-8:]]
top_negative_words = [index_to_words[x] for x in model.coef_.argsort().tolist()[0][:8]]
print('Top positive words:\t{}'.format(', '.join(top_positive_words)))
print('Top negative words:\t{}\n'.format(', '.join(top_negative_words)))
mlb.classes_
array(['accounting', 'admin', 'advertising', 'catering', 'charity',
'cleaning', 'construction', 'consultancy', 'creative', 'customer',
'design', 'domestic', 'energy', 'engineering', 'finance', 'gas',
'general', 'graduate', 'healthcare', 'help', 'hospitality', 'hr',
'jobs', 'legal', 'logistics', 'maintenance', 'manufacturing',
'marketing', 'nursing', 'oil', 'part', 'pr', 'property', 'qa',
'recruitment', 'retail', 'sales', 'scientific', 'services',
'social', 'teaching', 'time', 'trade', 'travel', 'voluntary',
'warehouse', 'work'], dtype=object)
print_words_for_tag(classifier, 'engineering', mlb.classes_, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier, 'healthcare', mlb.classes_, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier, 'sales', mlb.classes_, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier, 'scientific', mlb.classes_, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier, 'construction', mlb.classes_, tfidf_reversed_vocab, ALL_WORDS)
Tag: engineering Top positive words: calco, position candidates, wwwtotaljobscom, alecto, technologies ltd, introduction, pound k, uk skills Top negative words: contact recruiter, consultancy job, wwwcwjobscouk, posted wwwcareerstructurecom, wwwcareerstructurecom, jobseeking, posted wwwcwjobscouk, care Tag: healthcare Top positive words: territory, radiographer, goc, compass associates, optometrist, agency advertises, cares job, employer details Top negative words: handson nursing, developer, engineer, firm, ever need, retail, bonuses loyalty, reference jo Tag: sales Top positive words: repairs capital, posted wwwtotaljobscom, agency defined, following criteriaeducated, equivalentsmichael, wwwsalestargetcouk jobseeking, posted wwwsalestargetcouk, wwwsalestargetcouk Top negative words: wwwcwjobscouk, bms leading, wwwretailchoicecom, following criteria, wwwcaterercom, qualified, engineer, chef Tag: scientific Top positive words: agency employment, science, allied recruitment, field hays, hearing aid, scientific, populus, team24 Top negative words: apply online, removed, school, engineer, high, developer, financial, social Tag: construction Top positive words: cscs, cpcs, vertu, energy talent, wwwcareerstructurecom, wwwcareerstructurecom jobseeking, twittercom motortradejobs, posted wwwcareerstructurecom Top negative words: uk skills, technology, children, manager sales, care, amp, server, calco
# !pip3 install lightgbm
import pandas as pd
from scipy.sparse import coo_matrix, vstack
from sklearn.preprocessing import MultiLabelBinarizer
import lightgbm as lgb
import scipy
import numpy as np
import nltk, re
nltk.download('stopwords') # load english stopwords
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter("ignore")
warnings.warn("deprecated", DeprecationWarning)
warnings.simplefilter("ignore")
[nltk_data] Downloading package stopwords to [nltk_data] /Users/tajalahluwalia/nltk_data... [nltk_data] Package stopwords is already up-to-date!
dataset = pd.read_csv('Train_rev1.csv', error_bad_lines=False, engine="python")
dataset = dataset[dataset.Category.isin(dataset.Category.value_counts().index.tolist()[1:5])]
# dataset['Category']= dataset.Category.str[0:].str.split(' ').tolist()
dataset['Category'].value_counts()
Engineering Jobs 25174 Accounting & Finance Jobs 21846 Healthcare & Nursing Jobs 21076 Sales Jobs 17272 Name: Category, dtype: int64
dataset = dataset[['FullDescription', 'Category']]
# 70-30% random split of dataset
X_train, X_test, y_train, y_test = train_test_split(dataset['FullDescription'].values,
dataset['Category'].values,
stratify= dataset['Category'].values,
test_size=0.33,
random_state=420)
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = list((stopwords.words('english')))
def text_prepare(text,join_sumbol):
text = str(text)
"""
text: a string
return: modified initial string!
"""
# replace REPLACE_BY_SPACE_RE symbols by space in text
text = re.sub(REPLACE_BY_SPACE_RE," ",text,)
# lowercase text
text = text.lower()
# delete symbols which are in BAD_SYMBOLS_RE from text
text = re.sub(BAD_SYMBOLS_RE,"",text)
text = re.sub(r'\s+'," ",text)
# delete stopwords from text
text = f'{join_sumbol}'.join([i for i in text.split() if i not in STOPWORDS])
return text
X_train = [text_prepare(x,' ') for x in X_train]
X_test = [text_prepare(x,' ') for x in X_test]
# y_train = [text_prepare(x,' ') for x in y_train]
# y_test = [text_prepare(x,' ') for x in y_test]
idx = 922
display(y_train[idx])
display(X_train[idx])
'Engineering Jobs'
'immediate electrical engineers hvac electricians needed seeking immediately available electrical engineers work docklands primarily working street lighting lighting faults testing inspection works occasional working pump circuit controls well equipment exterior buildings grounds must 17th edition testing inspection certificate working earlies lates shift pattern wwwprsjobscom job originally posted wwwtotaljobscom jobseeking electricalengineer_job'
# !pip3 install lime
import lime
import sklearn
import numpy as np
import sklearn
import sklearn.ensemble
import sklearn.metrics
from __future__ import print_function
class_names = y_train
# let's use the tfidf vectorizer, commonly used for text.
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
# Use Multinomial Naive Bayes for classification, so that we can make reference to this document.
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(alpha=.01)
nb.fit(train_vectors, y_train)
MultinomialNB(alpha=0.01)
pred = nb.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred, average='weighted')
0.9333758658310932
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, nb)
print(c.predict_proba([X_train[idx]]).round(3))
print(y_train[idx])
[[0. 1. 0. 0.]] Engineering Jobs
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=class_names)
exp = explainer.explain_instance(X_train[idx], c.predict_proba, num_features=10, top_labels=5)
print(exp.available_labels())
[1, 3, 2, 0]
print ('Explanation for class %s' % class_names[1])
print ('\n'.join(map(str, exp.as_list(label=0))))
print ()
print ('Explanation for class %s' % class_names[0])
print ('\n'.join(map(str, exp.as_list(label=2))))
Explanation for class Engineering Jobs
('circuit', -0.0005055037273207403)
('inspection', -0.00048667664420182634)
('electricalengineer_job', -0.0004837421730272294)
('pump', -0.00047322764056142227)
('electrical', -0.0004427655429782297)
('lighting', -0.00043357716891999305)
('edition', -0.000430327198661014)
('earlies', -0.0004115406234869048)
('17th', -0.0004003031555053982)
('engineers', -0.00039012163726526915)
Explanation for class Healthcare & Nursing Jobs
('electricians', -0.0013826400350452404)
('electricalengineer_job', -0.0013042528695080334)
('exterior', -0.0012860932130299943)
('lighting', -0.0012311863798456547)
('engineers', -0.0012120915293569073)
('electrical', -0.001177728219577556)
('circuit', -0.0010346291740168962)
('edition', -0.0010129480165906883)
('wwwprsjobscom', -0.0009745095367927377)
('earlies', 0.0007409230943452269)
exp.show_in_notebook(text=X_train[idx], labels=(exp.available_labels()))